Method and apparatus for classifying and ranking interpretations for multimodal input fusion

ABSTRACT

A method used in an electronic equipment ( 100 ) generates a set of joint multimodal interpretations ( 125 ) from a set of multimodal interpretations ( 115 ) generated by one or more modalities ( 105 ) during a turn, generates a set of integrated multimodal interpretations ( 135 ) including an integrated multimodal interpretation formed from each joint multimodal interpretation by unifying the type feature structure of each multimodal interpretation in the joint multimodal interpretation, and generates a multilevel confidence score for each integrated multimodal interpretation based on at least one of a context score, a content score, and a confidence score of the integrated multimodal interpretation. The method classifies the multimodal interpretations and generates a set of joint multimodal interpretations that comprises essentially all possible joint multimodal interpretations. The multilevel confidence scoring is based on up to eleven factors, which provides for an accurate ranking of the integrated multimodal interpretations.

FIELD OF THE INVENTION

This invention is in the field of software and more specifically is inthe field of software that interprets inputs that have been receivednearly simultaneously by a plurality of input modalities.

BACKGROUND

In systems that use multimodal inputs, such as simultaneous speech,writing, and gesturing, for operating software applications, eachunimodal input is typically time segmented and recognized by specializedinput functions such as speech recognition, word processing, imagerecognition, and touch detection function, which produce individualmultimodal interpretations. The time segments may be called turns. Eachmultimodal interpretation is characterized by being identified to amodality (i.e, the identity of the input and recognizer), being given amultimodal type and a confidence score, and values for a set ofattributes associated with the multimodal type and modality aregenerated. The set of information that includes the identification ofthe modality, the multimodal type, the confidence score, and theattribute values is sometimes called a type feature structure. In someinstances, the recognition function generates a plurality of multimodalinterpretations from one input. For example, a gesture that points to amap may be interpreted as identifying the region of the map, or a hotelthat is on the map. In such instances, the recognizer generates a set(in this example, two) of ambiguous multimodal interpretations, each ofwhich typically has a lower confidence score than when one multimodalinterpretation is generated. In some instance, the recognition functioncan generate a plurality of multimodal interpretations from one inputthat result from sequential actions (e.g., two gestures may be madeduring one turn). Such multimodal inputs are independent (notambiguous).

The multimodal interpretations generated during a time segment are thenanalyzed as a set to determine a most probable meaning of them wheninterpreted together. One or more joint multimodal interpretations aregenerated and a unified type feature structure is generated for eachjoint multimodal interpretation. An application then uses the unifiedtype feature structure as an input for the application.

In some reported implementations, such as that described in “MultimodalInterfaces That Process What Comes Naturally”, by Sharon Oviatt andPhilip Cohen, Communications of the ACM, March 2000, Vol. 43, No. 3,when an ambiguous set of multimodal interpretations is generated,combinations are formed using different members of the set of ambiguousmultimodal interpretations and the confidence scores of each multimodalinterpretation are evaluated in a variety of ways to select a top-rankedjoint multimodal interpretation to send to “the system's ‘applicationbridge’ agent, which confirms the interpretation with the user and sendsit to the appropriate backend application.” This approach isinappropriate because the selection of the top-ranked joint multimodalinterpretation is not sufficiently reliable for using it without userconfirmation, and this approach obviously slows down the speed of input.

Another limitation of some reported implementations is that there is noproposed mechanism for handling independent, non-ambiguous multimodalinterpretations from one modality in one turn; the duration of turnshave to be managed to avoid independent, non-ambiguous multimodalinterpretations from one modality in one turn, and when such managementfails, unreliable joint multimodal interpretations result.

What is needed is a more comprehensive and reliable approach forhandling multimodal inputs so as to generate better information to passon to applications.

BRIEF DESCRIPTION OF THE DRAWINGS

The present invention is illustrated by way of example and notlimitation in the accompanying figures, in which like referencesindicate similar elements, and in which:

FIG. 1 is a block diagram that shows functions of an electronicequipment for accomplishing multimodal fusion of inputs presented to theelectronic equipment, in accordance with some embodiments of the presentinvention;

FIGS. 2-3 are representations of data structures that are used duringthe fusion of multimodal interpretations, in accordance with someembodiments of the present invention;

FIGS. 4-9 show data structures and flow charts of a method used by asemantic classifier of the electronic equipment for generating jointmultimodal interpretations, in accordance with an example of someembodiments of the present invention; and

FIGS. 10-13 show flow charts that illustrate a method used by aninteraction manager of the electronic equipment to generate a multilevelconfidence score for each integrated multimodal interface, in accordancewith some embodiments of the present invention.

Skilled artisans will appreciate that elements in the figures areillustrated for simplicity and clarity and have not necessarily beendrawn to scale. For example, the dimensions of some of the elements inthe figures may be exaggerated relative to other elements to help toimprove understanding of embodiments of the present invention.

DETAILED DESCRIPTION OF THE DRAWINGS

Before describing in detail the particular multimodal fusion method andapparatus in accordance with the present invention, it should beobserved that the present invention resides primarily in combinations ofmethod steps and apparatus components related to multimodal fusiontechnology. Accordingly, the apparatus components and method steps havebeen represented where appropriate by conventional symbols in thedrawings, showing only those specific details that are pertinent tounderstanding the present invention so as not to obscure the disclosurewith details that will be readily apparent to those of ordinary skill inthe art having the benefit of the description herein.

Referring to FIG. 1, a block diagram shows functions of electronicequipment 100 that includes apparatus and methods for accomplishingmultimodal fusion of inputs presented to the electronic equipment 100,in accordance with some embodiments of the present invention. Theelectronic equipment 100 comprises input apparatuses 105 of differingtypes, called modalities, that process input actions that have typicallybeen made by humans. An example of a typical combination of modalities105 that the electronic equipment 100 might have is a microphone, atouch sensitive display screen, a camera (for capturing gestures), andkey switches. The inputs from the modalities 105 are rendered intoelectronic signals 106 by each modality, which are time segmented andrecognized by functions within the segmentation and recognition function110. The operation of the modalities 105 and the segmentation andrecognition functions 110 are well understood by those of ordinary skillin the art, and for the example mentioned would typically include avoice recognition function for the microphone modality, a handwritinginterpreter for the touch screen modality, a gesture interpreter for thecamera modality, and a command recognizer for the key switches. Thesegmentation and recognition function 110 generates a set of multimodalinterpretations 115 for each time segment, called a turn, of a sequenceof time segments. The time segment for a turn may overlap the timesegment for a subsequent or previous turn. Each multimodalinterpretation (MMI) in the set of multimodal interpretations 115 istypically a unimodal interpretation; that is, each is an interpretationfrom one modality, but the MMI are so called herein because they may begenerated by any of a plurality of modalities. Each MMI is typicallyrendered as one type of a set of possible types. Many types can beclassified as tasks. Thus, a speech input of “create a route from hereto there” might be interpreted as being a part of a “create route” taskor an “identify a hotel” (that is visible on a map) task. Each MMI typeis typically associated with a defined set of attributes. An MMI of onetype may be generated by more than one modality. The segmentation andrecognition function 110 typically generates each MMI with an associatedconfidence score that indicates a confidence that the MMI type has beenaccurately determined, and may also generate the value of one or moreattributes of the MMI. The MMI type, the modality that generated the MMI(the MMI modality), identification of the turn in which the MMI wasgenerated, and the defined set of attributes, including any knownvalues, are stored and represented as set of information called a typefeature structure (TFS). It is not infrequent that the segmentation andrecognition function 110 will render several multimodal interpretationsfrom one modality in a turn. Some of these may be alternativeinterpretations derived from segments of the electronic signal that havebeen received from one modality in a turn, for which the segments of theelectronic signal from which the alternative MMIs are derived aresubstantially overlapping. In this situation, these alternativeinterpretations are described as a subset of ambiguous MMIs. There maybe multiple subsets of such ambiguous MMIs, typically generated by morethan one modality. The MMI types of MMIs in a subset of ambiguous MMIsmay differ from each other.

MMIs resulting from segments of the electronic signal that have beenreceived from a modality in a turn, for which the segments of theelectronic signals from which the MMIs are derived are substantiallynon-overlapping, and which are therefore essentially independent MMIs,are called non-ambiguous interpretations. It will be appreciated thatthe confidence scores of ambiguous interpretations are typically lowerthan the confidence scores of non-ambiguous interpretations, but this isnot always so. In summary each MMI is identified by an MMI type, aconfidence score, a turn, and may have attribute values associated withit. For the rest of this document, the analysis and manipulation of MMIsfrom only one turn will be discussed, so that it when it is stated thatthe MMIs can be identified by their MMI type and MMI modality, it isimplied that they are generated in one turn.

The set of multimodal interpretations 115 generated in a turn is coupledto and analyzed by a semantic classifier 120, and a set of jointmultimodal interpretations (joint MMIs) 125 is generated that isuniquely more comprehensive in many instances than the sets of jointMMIs generated by prior art multimodal fusion techniques, in that morecombinations may result. The semantic classifier 120 uses a set ofinterpretation type relationships that are stored in a domain and taskmodel 150. Details of this aspect are described more fully below. TheTFSs of the joint MMIs 125 generated by the semantic classifier 120 arethen coupled to and unified by integration function 130, usingtechniques similar to those used in prior art systems, resulting in aset of integrated MMIs 135 that are coupled to an interaction manager140. The interaction manager 140 generates a set of ranked MMIs 145 in aunique manner more fully described below that are coupled to one or moreapplications 160. A multi-level confidence score is determined for eachMMI in the set of ranked MMIs by the interaction manager using a muchmore comprehensive analysis than used in prior art MMI fusiontechniques. The MMIs in the set of ranked MMIs 145 are preferablytransferred to an application 160 in order of descending multi-levelconfidence scores (i.e., higher to lower overall confidence scores).Applications that are capable of doing so can use the ranked MMIs andtheir associated multi-level confidence scores to make a better decisionas to which MMI or MMIs to use during a turn than in prior arttechniques. Less capable applications can simply use the first MMI,which will be more reliably determined in many instances than in priorart techniques.

The domain and task model 150, in addition to storing the set ofinterpretation type relationships, stores the TFS definitions for thevarious MMI types and the various integrated MMI types, and storesenvironmental information used by the interaction manager 140, althoughit will be appreciated that such storage could be distributed to anyextent, such as splitting it into one or more of the functions 110, 120,130, 140. Environmental information is passed to the domain and taskmodel 150 from an environmental inputs function 155, that may be coupleddirectly to environmental inputs such as a clock 170 (to capture a timeof day) or coupled 107 to a modality 105 used for generating MMIs ( forexample, a microphone to capture a background noise level).Alternatively or in addition, the environmental inputs function 155 maybe coupled 108 to a segmentation and recognition function 110 such as asound analyzer portion of a speech analyzer, to accept a recognizedenvironmental parameter, such as the background noise level.

Fusion of MMIs refers to the processes described above, from the pointwhere MMIs are transferred to the semantic classifier 120 to the pointwhere the ranked MMIs are transferred to an application 160.

Referring to Tables 1-3 and FIGS. 2-3, representations of datastructures that are used during the fusion of MMIs are shown, inaccordance with some embodiments of the present invention. Table 1 showsthe TFSs for a set of four MMIs generated by the segmentation andrecognition function 110 for speech, gesture and handwriting modalities105.

TABLE 1 $\begin{bmatrix}{{Modality}\text{:}{Speech}} \\{{{startTime}\text{:}15\text{:}10\text{:}24.01},{15\mspace{14mu} {May}\mspace{14mu} 2002}} \\{{{endTime}\text{:}15\text{:}10\text{:}25.77},{15\mspace{14mu} {May}\mspace{14mu} 2002}} \\{{content}\text{:}{ambiguous}\mspace{14mu} \left( {\begin{bmatrix}{{type}\text{:}{CreateRoute}} \\{{Confidence}\text{:}0.7} \\{{Reference\_ Order}\; \text{:}\mspace{11mu} {\$ ref1}\text{,}{\$ ref2}} \\{{Source}\text{:}{{\$ ref1}\left( {{deictic}\text{,1}} \right)}} \\{{mode}\text{:}} \\{{Destination}\text{:}{{\$ ref2}\left( {{deictic}\text{,1}} \right)}}\end{bmatrix},\begin{bmatrix}{{type}\text{:}{GetInfoHotel}} \\{{Confidence}\text{:}0.2} \\{{{Reference\_ Order}\; \text{:}\; {\$ ref1}},{\$ ref2}} \\{{Hotel}\text{:}{{and}\left( {{\$ ref1}\left( {{deictic}\text{,1),\$ref2(deictic,1))}} \right.} \right.}}\end{bmatrix}} \right)}\end{bmatrix}\quad$ ${\begin{bmatrix}{{Modality}\text{:}{Gesture}} \\{{{startTime}\text{:}15\text{:}10\text{:}24.43},{15\mspace{14mu} {May}\mspace{14mu} 2002}} \\{{{endTime}\text{:}15\text{:}10\text{:}24.56},{15\mspace{14mu} {May}\mspace{14mu} 2002}} \\{{content}{\text{:}\;\begin{bmatrix}{{type}\text{:}{StreetAddress}} \\{{Confidence}\text{:}0.7} \\{{Reference\_ Order}\; \text{:}} \\{{Street}\text{:}12\mspace{14mu} {Lord}\mspace{14mu} {Street}} \\{{City}\text{:}{Botany}} \\{{State}\text{:}{NSW}} \\{{Zip}\text{:}2019} \\{{Country}\text{:}{Australia}}\end{bmatrix}}}\end{bmatrix}\begin{bmatrix}{{Modality}\text{:}{Handwriting}} \\{{{startTime}\text{:}15\text{:}10\text{:}26.32},{15\mspace{14mu} {May}\mspace{14mu} 2002}} \\{{{endTime}\text{:}15\text{:}10\text{:}27.11},{15\mspace{14mu} {May}\mspace{14mu} 2002}} \\{{content}{\text{:}\begin{bmatrix}{{type}\text{:}{CreateRoute}} \\{{Confidence}\text{:}0.7} \\{{Reference\_ Order}\; \text{:}} \\{{Source}\text{:}} \\{{mode}\text{:}{quickest}} \\{{Destination}\text{:}}\end{bmatrix}}}\end{bmatrix}}\quad$

In this example, a user of a multimodal map-based navigation applicationsays, “Create a route from here to there”, makes a pointing gesture toselect a location shown on a map that is displayed on a touch sensitivescreen, and writes “quickest” on the screen. The spoken input isambiguously interpreted by the segmentation and recognition function 110as one of two tasks (i.e., two MMI types)—CreateRoute and GetInfoHotel.The subset of TFSs for the two ambiguous speech MMIs are identified assuch by “content: ambiguous” and identified by a set of braces. It canbe observed that the modality, turn, and confidence score are includedin the TFS for each MMI, as well as lists of attributes. The turn inthis example is identified by exact start and stop times, which may beobserved to be different but within four seconds of each other. This isa very typical time spread and overlapping of times within a turn. Itwill be observerd that some attributes have explicit references that arenot identified at this point in the analysis of the MMIs—for example,“Source:$ref1 (deictic,1)” in the TFS for the Create Route type of MMIfrom the Speech Modality. Some attributes have no values at this pointin the analysis of the MMIs, such as “mode:” in the TFS for the CreateRoute type of MMI from the Speech Modality. Some other attributes havevalues, for example, “mode:” in the TFS for the Create Route type of MMIfrom the Gesture Modality has a value of “quickest”.

For purposes of generating joint MMIs by the semantic classifierfunction 120, the set of four MMIs represented by the TFSs in Table 1may be represented more succinctly as shown in FIG. 2, in which the MMIs200 are illustrated as being derived from the modalities 105, andwherein each MMI is assigned four identifying values. Two of the MMIs205 are ambiguous MMIs, as noted above in Table 1. For simplicity ofprogramming in ensuing stages of the analysis, identification numbers210 are assigned to each non-ambigous MMI and an identification number210 is assigned to each MMI in each subset of ambiguous MMIs, of whichthere is only one subset 205 in this example. The MMI modality isidentified in this and other examples herein by a two letter code 215.The MMI type is identified by a one letter code 220, and the confidencescore is provided as a value 225. Letter code A indicates Create Route;Letter code B indicates GetInfoHotel; and Letter code C indicatesStreetAddress. It will be appreciated that many other methods could beused to uniquely identify the MMI type, MMI modality, and confidencescores of the MMIs, as well as the subsets of ambiguous MMIs within aturn.

The semantic classifier 120 generates a set of joint MMIs 300 from theset of MMIs 200. These joint MMIs 300 are illustrated in FIG. 3 by setsof the MMIs that have been formed using a set of interpretation typerelationships stored in the domain and task model 150. In FIG. 3, twojoint interpretations 305, 310 have been formed. The relationships ofMMIs within the joint MMI 305 are signified by the line between thecircles and by the inclusion of two MMIs of the same MMI type within onecircle. The joint MMI 310 is named a joint MMI even though it includesonly one MMI, because it has been formed by the same process used by thesemantic classifier 120 to determine other joint MMIs.

The set of joint MMIs 300 is passed to the integration function 130,resolves the explicit references within the TFSs for each joint MMI. Forjoint MMI 305, the reference variable $ref1 is satisfied as a contextualreference by getting the current location from a Contextual Referencedatabase in the domain and task model 150, and reference variable $ref2is satisfied with the TFS of the MMI received from gesture. For jointMMI 310, the reference variables are not resolved because there are nopossible referents in the TFSs for joint MMI 310 and the ContextualReference database does not provide appropriate contextual referents.The two reference variables in the TFS for MMI 310 are replaced with the‘unresolved’ operator. After reference resolution, the TFSs are unified.For joint MMI 305, the two TFSs of the same MMI type (A) are unifiedusing a fusion algorithm, generating a first integrated MMI, for whichthe unified TFS is shown in Table 2.

TABLE 2 $\begin{bmatrix}{{{Modality}\text{:}{Speech}},{Gesture},{Handwriting}} \\{{{startTime}\text{:}15\text{:}10\text{:}24.01},{15\mspace{14mu} {May}\mspace{14mu} 2002}} \\{{{endTime}\text{:}15\text{:}10\text{:}27.11},{15\mspace{14mu} {May}\mspace{14mu} 2002}} \\{{content}{\text{:}\begin{bmatrix}{{type}\text{:}{CreateRoute}} \\{{Confidence}\text{:}0.7} \\{{Reference\_ Order}\; \text{:}} \\{{Source}{\text{:}\;\begin{bmatrix}{{type}\text{:}{StreetAddress}} \\{{Street}\text{:}1\mspace{14mu} {Main}\mspace{14mu} {St}} \\{{City}\text{:}{Bondi}} \\{{Zip}\text{:}2034}\end{bmatrix}}} \\{{mode}\text{:}} \\{{Destination}{\text{:}\;\begin{bmatrix}{{type}\text{:}{StreetAddress}} \\{{Street}\text{:}12\mspace{14mu} {Lord}\mspace{14mu} {Street}} \\{{City}\text{:}{Botany}} \\{{State}\text{:}{NSW}} \\{{Zip}\text{:}2019} \\{{Country}\text{:}{Australia}}\end{bmatrix}}}\end{bmatrix}}}\end{bmatrix}$

For joint MMI 310, there is only one MMI. Hence, no unificaiton isperformed and the integrated MMI is the MMI itself, for which theunified TFS is shown in Table 3.

TABLE 3 $\begin{bmatrix}{{Modality}\text{:}{Speech}} \\{{{startTime}\text{:}15\text{:}10\text{:}24.01},{15\mspace{14mu} {May}\mspace{14mu} 2002}} \\{{{endTime}\text{:}15\text{:}10\text{:}25.77},{15\mspace{14mu} {May}\mspace{14mu} 2002}} \\{{content}{\text{:}\begin{bmatrix}{{type}\text{:}{GetInfoHotel}} \\{{Confidence}\text{:}0.2} \\{{Reference\_ Order}\; \text{:}} \\{{Hotel}\text{:}{{and}\left( {{{unresolved}({\$ ref1})}\text{,unresolved(\$ref2))}} \right.}}\end{bmatrix}}}\end{bmatrix}\quad$

The integrated MMIs having TFSs shown in Tables 2 and 3 are ranked bythe interaction manager 140 and transferred to the application 160.

Referring to FIGS. 4-9, a method used by the semantic classifier 120 forgenerating joint MMIs from a set of MMIs generated in a turn is nowdescribed in more detail, in accordance with an example of someembodiments of the present invention. A set of MMIs 400 for this exampleincludes six MMIs, three of which are in an ambiguous set of MMIs havingan ID of 1, as shown in FIG. 4. A set of interpretation typerelationships 500 is stored in the domain and task model 150, which isshown in FIG. 5 as a two dimensional matrix. This is a set of directrelationships between MMIs identified by the MMI type and the MMImodality. The first column 505 of the matrix is a set of identifiersthat identify an MMI type as a letter followed (after a dash) by a twoletter code that identifies the MMI modality. The first row 510 of thematrix consists of the same set of identifiers. An interpretation typerelationship between a first MMI type and modality combination and asecond MMI type and modality combination is specified at an intersectionof a row that is identified by the first MMI type and modalitycombination and column that is identified by the second MMI type andmodality combination, except for intersections of the exact same MMItype and modality, in which an X is entered. The following relationshipsexist in some embodiments of the present invention: a SAME relationship(i.e., both types are the same, although from differing modalities)listed as “SAME”, a CONTAINED relationship listed as “CTD”, a CONTAINERrelationship' listed as “CTR”, a SUPER relationship' listed as “SUP”,and a SUBORDINATE relationship' listed as “SUB”. Where no relationshipexists, a “—” is shown. The CONTAINED and CONTAINER relationships, asevidenced in this exemplary set, are reciprocal, as are the SUPER andSUBORDINATE relationships. It will be appreciated that because of thereciprocity of relationships, the matrix could be reduced to atriangular form, and it will be further appreciated that there are avariety of other ways to store the information in a computer memory.

Referring to FIG. 6, a flow chart shows some steps of a method forfusion of multimodal interpretations. At step 600 a set of MMIs 115generated within a turn and are received at the semantic classifier 120.The set of MMIs 115 have been generated by the set of modalities 105. Atstep 605, each MMI is identified as one of either a non-ambiguousinterpretation or an ambiguous interpretation of a subset of ambiguousinterpretations of the set of MMIs. Each MMI is identified with themodality that has generated the multimodal interpretation. Each MMI isassigned an interpretation type and confidence score.

At step 610, the semantic classifier 120 creates a set of initial jointMMI roots from the set of MMIs 115 and identifies type sets within theset of initial joint MMI roots, each type set including allnon-ambiguous interpretations and subsets of ambiguous interpretationsthat have a common type.

At step 615 the semantic classifier 120, referring to the set ofinterpretation type relationships 500, removes from the set of initialjoint MMI roots each MMI for which the MMI has a contained relationshipwith any other MMI in the set of initial joint MMI roots, except thatthe MMI is not removed when the MMI has a contained relationship onlywith one or more MMIs in a subset of ambiguous interpretations thatincludes the MMI (if such subset exists).

At step 620 the semantic classifier 120, again referring to the set ofinterpretation type relationships 500, removes from the set of jointinitial MMI roots each MMI for which the MMI has a super relationshipwith any other MMI in the set of initial joint MMIs, except that the MMIis not removed when the MMI has a super relationship only with one ormore MMIs in a subset of ambiguous interpretations that includes the MMI(if such subset exists).

The removed MMIs form at step 625 a set of removed MMIs and theremaining MMIs form a set of initial joint MMI roots.

This method is completed when the sematic classifier 120 performs aprocess at step 630 of forming a complete set of joint MMIs 125 from theset of initial joint MMI roots and the set of removed MMIs, again usingthe set of interpretation type relationships.

Referring to FIG. 7, the set of MMIs 400 is shown as it has beenreorganized into a set of initial joint MMI roots 705 and a set ofremoved MMIs 710, using the exemplary set of interpretation typerelationships 500 and the steps 610-625. In the set of removed MMIs 710,MMIs of type D and type E were removed by step 615 because they arecontained, respectively, in MMI type A from the speech modality and MMItype C from the speech modality, and MMI type C was removed in step 620because it has relationship super to MMI type A of modality hardware.MMI type A of the speech modality and MMI type A of the handwritingmodality form an initial joint MMI root 715 and MMI type B forms aninitial joint MMI root 720.

Referring to FIG. 8, some steps of the process 630 of forming a set ofjoint MMIs 125 from the set of initial joint MMI roots and the set ofremoved MMIs are shown, in accordance with some embodiments of thepresent invention. The set of initial joint MMI roots is transformed bythe process 630 into a set of joint MMIs by forming sets of MMI trees inwhich each MMI tree, when completed, represents the relationships of theMMIs within a joint MMI. An initial joint MMI root or the initial MMIroot as MMIs are added to it by the process 630 are thus referred to asan MMI tree, and the set of initial joint MMI roots is a set of MMItrees. At step 805 a loop is started in which each MMI in the set ofremoved MMIs is processed. An inner loop is started at step 810, inwhich each MMI tree in the set of MMI trees is processed. At step 815, adetermination is made as to whether the MMI is in a set of ambiguousMMIs. When the MMI is in a set of ambiguous MMIs, an inner loop isstarted at step 820 in which each MMI within the ambiguous set isprocessed. As a next step of the inner loop started at step 820, or whena determination is made at step 815 that the MMI is a non-ambiguous, adetermination is then made at step 825 as to whether there exists anyinterpretation type relationship between the MMI and the root of the MMItree being processed, using a set of interpretation type relationshipssuch as the one described with reference to FIG. 5. When there is foundto be no interpretation type relationship at step 825, then the processcontinues at step 865 described below. When there is found to be aninterpretation type relationship at step 825, then a determination ismade as to whether the interpretation type relationship is of the type“SAME” at step 830. When the interpretation type relationship is of thetype “SAME”, then the MMI is added to the root of the MMI tree at step835 and the process is continued at step 865 as described below. Whenthe interpretation type relationship is not of the type “SAME”, then adetermination is made at step 840 as to whether the interpretation typerelationship is one of “SUPER” or “CONTAINED”, and when it is not theprocess is continued at step 865, as described below. When theinterpretation type relationship is determined to be one of “SUPER” or“CONTAINED” at step 840, a determination is made at step 845 as towhether there already exists an MMI from the same set of ambiguousinterpretations in the MMI tree (this can be determined by finding if ithas the same ID as the MMI being processed). When there does not alreadyexist an MMI from the same set of ambiguous interpretations in the MMItree, then the MMI being processed is added to the MMI tree and theprocess continues at step 865, described below. When an MMI in the MMItree is found to be in the same set of ambiguous interpretations as theMMI being processed at step 845, a copy of the MMI tree is made and theMMI in the copy of the MMI tree is replaced by the MMI being processedat step 855, and the new MMI tree is added to the set of MMI trees atstep 860.

When there is found to be no interpretation type relationship betweenthe MMI and the MMI tree at step 825, or when the interpretation typerelationship has been found to be “SAME” and a new MMI tree has beenadded at step 835, or when the interpretation type is found not to be“SUPER” or “CONTAINED” at step 840, or when the interpretation type hasbeen found to be “SUPER” or “CONTAINED” and the MMI has been added tothe MMI tree at step 850, or when the interpretation type has been foundto be “SUPER” or “CONTAINED” and the new MMI tree has been added at step860, a determination is made as to whether the MMI being processed is ina set of ambiguous interpretations at step 865, and when it is, theinner loop for the set of ambiguous interpretations is cycled at step870 (i.e., a next iteration is performed or the loop is completed). Whenthe MMI being processed is not in a set of ambiguous interpretations atstep 865, or when the inner loop for the set of ambiguousinterpretations is competed at step 870, then the inner loop for the setof MMI trees is cycled at step 875. When the inner loop for the set ofMMI trees is completed at step 875, then the loop for removed MMIs iscycled at step 880. Note that the loop for removed MMIs is cycled sothat all removed MMIs are processed until they are added onto at leastone tree. Some MMIs may have to be processed through the loop more thanonce to be added to a tree. When the loop for removed MMIs is completedat step 880 by depletion of all MMIs from the set of removed MMIs, theprocess 630 to form a set of joint MMIs from the initial joint MMI rootsand the removed MMIs is complete.

Referring to FIG. 9, a set of joint MMIs 900 is shown that has resultedfrom the application of the method described with reference to FIG. 8 tothe set of initial joint MMI roots 705 and a set of removed MMIs 710. Ata first pass of the method, the MMI having type E is not added to any ofthe initial joint tree roots because there is no relationship typefound. At a second pass, the MMI having type D is found to have aCONTAINED type of relationship with relationships type A-SP, A-HW, andA-KB, so it is added to initial tree root 715 of FIG. 7, forming tree905 of FIG. 9. At a third pass, the MMI having type C is found to have aSUPER relationship with relationship type A-HW, but it has the same IDas MMI having type and modality A-SP of tree 905 (they are in the sameset of ambiguous MMIs), so a new tree 915 is generated by copying tree905 and removing the MMI having type and modality A-SP. The MMI type Cis added to tree 915. At a fourth pass, the MMI having type isre-evaluated and is found to have a CONTAINED relationship to MMI typeand modality C-SP, so it is added to tree 915. Nothing has been added toinitial joint MMI root 720, so it becomes tree 910. The trees 905, 910,915 form the set of joint MMIs 900.

It can be seen that the processes described with reference to FIGS. 6and 8 classify the MMIs and may generate a set of joint MMIs thatcomprises all possible joint MMIs, each of which is formed from one ofall possible combinations of MMIs of the set of MMIs such that eachjoint MMI has no more than one ambiguous interpretation from each subsetof ambiguous interpretations, and such that the MMIs of each joint MMIsatisfy an interpretation type relationship of a defined set ofinterpretation type relationships based on the interpretation type andmodality of each MMI. The possible combinations of MMIs may includethose that have as few as one MMI in them, as illustrated by the exampleof FIG. 9, and can provide multiple trees on those occasions when onemodality generates two independent MMI's (not illustrated in thefigures). It will be appreciated that the steps of the methods shown inFIGS. 6 and 8 may not need to be performed in the order shown and thatthe methods described are but one way to accomplish the unique objectiveof classifying the MMIs and generating a set of joint MMIs thatcomprises all possible joint MMIs as defined above.

Referring now to FIG. 10, a flow chart shows a method used by theinteraction manager 140 to generate a multilevel confidence score foreach integrated MMI of the set of MMIs generated by the integrationfunction 130 (FIG. 1), in accordance with some embodiments of thepresent invention. Integrated MMI may hereafter be referred to as IMMI.At step 1005, a context score of an IMMI (CTXT(IMMI)) that is to berated is determined, as will be described in more detail below. At step1010, a content score of the IMMI (CTNT(IMMI)) is determined, as will bedescribed in more detail below. At step 1015, a confidence score of theIMMI (CONF(IMMI)) is determined, as will be described in more detailbelow. At step 1020, a multilevel confidence score of the IMMI(MCONF(IMMI)) is determined based on at least one of the context score,the content score, and the confidence score of the integrated MMI. Inother terms, MCONF(IMMI)=f₁(CTXT(IMMI), CTNT(IMMI), CONF(IMMI)). Thefactors and formulas on which the multilevel confidence scores are basedare designed such that IMMIs having higher multilevel confidence scorescorrelate to IMMIs that are more likely to represent the attemptedmessage. The multilevel confidence score may by based on a subset of thecontent, context, and confidence scores for the IMMI, but it will beappreciated that a more reliable score results when more of thesefactors are properly designed and included. Proper design may includethe use of historical data and iterative design techniques known in theart for optimizing such a predictive function. Neural network andrelated techniques may be also employed. The interaction manager 140generates the multilevel confidence scores for all the IMMIs in the setof IMMIs generated for a turn, then generates a set of ranked MMIs 145in order of decreasing multilevel confidence scores.

Referring to FIG. 11 a method of generating the context score of theIMMI is shown, in accordance with some embodiments of the presentinvention. At step 1105, a determination is made of P_(PMMI), which is aquantity of turns ago that an integrated multimodal interpretation wasgenerated that had a type that is related to the type of the integratedmultimodal interpretation.

At step 1110, a determination is made of RS(PIMMI, IMMI), which is atype relationship score determined by a type of relationship between theIMMI and the IMMI generated P_(PMMI) turns ago.

The type relationship score is a single numerical value that isdetermined by a larger set of type relationships than those describedfor the set of interpretation type relationships illustrated in FIG. 8.The type relationship score is a function that is used for thisdetermination and others that follow. The type relationship scoreincludes scores for the DIRECT type relationships, but also includesscores for INDIRECT type relationships. There may be 9 non-zero values(for both DIRECT and INDIRECT versions of SUPER, SUBORDINATE, CONTAINED,and CONTAINER relationships, and for the SAME relationship), and a zerovalue for no type relationship. The score for SAME is typically greaterthan the scores for any DIRECT type relationships, and the scores forINDIRECT type relationships are typically lower than the scores for anyof the DIRECT type relationships. In the examples used in this document,a type relationship score RS(MMI1, MMI2)=RS(MMI2, MMI1), although theconcepts presented herein can be easily extended to a non-reciprocalrelationship.

At step 1115, a determination is made of RS(IMMI, MMI(j)|j=1 to Q),which is a set of type relationship scores determined by the types ofrelationships between the IMMI and each of a set of IMMI that have beenpredicted to be generated within the turn. The prediction may be made,for example, based on a set of message sequences appropriate for theapplication type to which the ranked set of IMMIs is being delivered,determined by analysis of the application and/or history of previous(successful) message sequences supplied to the application, and a recenthistory of IMMIs delivered to the application by the interaction manager140.

At step 1120, the context score of the IMMI, (CTXT(IMMI)), is determinedbased on at least one of P_(PMMI), RS(PIMMI), and RS(IMMI, MMI(j)|j=1 toQ). In other terms, CTXT(IMMI)=f₂(P_(PMMI), RS(PIMMI), RS(IMMI,MMI(j)|j=1 to Q)). Preferably, f₂ is related in a inverse manner to thevalue of P_(PMMI) (i.e, f₂ decreases while P_(PMMI) increases positivelyas the number of previous turns increases).

Referring to FIG. 12, a method of determining the content score of theIMMI is shown, in accordance with some embodiments of the presentinvention. At step 1205, a determination of N is made, which is thequantity of attributes included in the type feature structure of theintegrated multimodal interpretation. For example, in the TFSillustrated in Table 2, there are 9 attributes listed (source street,city, and zip; mode; and destination street, city, state, zip, andcountry).

At step 1210, a determination is made of N_(V), a quantity of theattributes that were given at least one value in the turn. For theexample illustrated in Table 2, N_(V) is 8.

At step 1215, a determination is made of N_(R), a quantity of theattributes that were given redundant values in the turn. The termredundant in this context means that when the joint MMI was integrated,two or more TFS's of MMIs that formed the joint MMI had values for oneattribute that were identical. (No example is given herein).

At step 1220, a determination is made of N_(M), which is a quantity ofthe attributes that were given an explicit reference but not given avalue in the turn. An example of this is illustrated in Table 3 whereinthe attribute Hotel has two unresolved references, so N_(M) is 2 forthis example.

At step 1225 a determination is made of CSA(i)|i=1 to N, which is a setof confidence scores, one for each attribute that was given a value inthe turn. In the example illustrated by Table 2, 8 confidence scoreswould be determined. These may be generated by the segmentation andrecognition function 110 and passed along as part of the TFS, or theymay be determined from past statistics.

At step 1230, the content score of the IMMI, (CTNT(IMMI)), is determinedbased on at least one of N, N_(V), N_(R), N_(M), and CSA(i)|I=1 to N. Inother terms, CTNT(IMMI)=f₃(N, N_(V), N_(R), N_(M), CSA(i)|i=1 to N).

Referring to FIG. 13, a method of generating the confidence score forthe IMMI is shown, in accordance with some embodiments of the presentinvention. At step 1305, a determination is made of CSMMI(k)|k=1 to R,which is a confidence score generated for each MMI that is included in aset of MMIs that formed the IMMI. For the example illustrated in FIG. 5,this would be the confidence scores for the CreateRoute type of thespeech modality, the StreetAddress type of the gesture modality, and theCreateRoute type of the handwriting modality.

At step 1310, a determination is made of CSIMMI(m)|m=1 to T, which is aset of type relationship scores determined by the type of relationshipbetween each pair of MMIs within the set of MMIs that formed the IMMI.For the tree 905 illustrated in FIG. 12, this would be the typerelationship scores for three pairs of relationships: {(1, SP, A,0.5),(4, HW, A, 0.7)}, {(2, GS, D, 0.7),(4, HW, A, 0.7)}, and {(2, GS,D, 0.7), (1, SP, A, 0.5)}.

At step 1315, a determination is made of MODREL(m)|m=1 to T, which is areliability score of each of the one or more modalities that generatedthe set of MMIs that formed the IMMI. For the tree 905 illustrated inFIG. 12, this would be reliability scores for the SP (speech), HW(handwriting), and GS (gesture) modalities. The reliability score of amodality is based on a history of confidence scores of the modality anda current environment type. Current environment types can include suchparameters as background acoustical noise, location, vibration level,lighting conditions, etc.

At step 1320, the confidence score of the IMMI, (CONF(IMMI)), isdetermined based on at least one of CSMMI(k)|k=1 to R, CSIMMI(m)|m=1 toT and MODREL(m)|m=1 to T. In other terms, CONF(IMMI)=f₄(CSMMI(k)|k=1 toR, CSIMMI(m)|m=1 to T, MODREL(m)|m=1 to).

The multilevel confidence scoring can be seen to be based on up toeleven factors (steps 1105, 1110, 1115, 1205, 1210, 1215, 1220, 1225,1305, 1310, 1315), which provides for an accurate ranking of theintegrated multimodal interpretations. It will be appreciated that priorart systems might incorporate one or two factors similar to thosedescribed herein, but the present invention provides a robust and moreaccurate method of providing a set of ranked IMMIs to an application,thereby increasing the responsiveness of the application to the inputsand reducing the need for manual verification or manual reentry of aninput. This can be crucial in some applications, such as vehiculardriver advocacy applications.

The factors and formulas on which the context, content, and confidencescores are based are designed such that IMMIs having higher multilevelconfidence scores correlate to IMMIs that are more likely to representthe attempted message. The multilevel confidence score may be based on asubset of the content, context, and confidence scores for the IMMI, butit will be appreciated that a more reliable score results when more ofthese factors are properly designed and included. Proper design mayinclude the use of historical data and iterative design techniques knownin the art for optimizing such a predictive function. Neural network andrelated techniques may be also employed.

The multimodal fusion technology as described herein can be included incomplicated systems, for example a vehicular driver advocacy system, orsuch seemingly simpler consumer products ranging from portable musicplayers to automobiles; or military products such as command stationsand communication control systems; and commercial equipment ranging fromextremely complicated computers to robots to simple pieces of testequipment, just to name some types and classes of electronic equipment.

It will be appreciated the multimodal fusion technology described hereinmay be comprised of one or more conventional processors and uniquestored program instructions that control the one or more processors toimplement some, most, or all of the functions described herein; as such,the functions of determining a set of joint multimodal interpretationsand determining a ranked set of integrated multimodal interpretationsmay be interpreted as being steps of a method. Alternatively, the samefunctions could be implemented by a state machine that has no storedprogram instructions, in which each function or some combinations ofcertain portions of the functions are implemented as custom logic. Acombination of the two approaches could be used. Thus, methods and meansfor performing these functions have been described herein.

In the foregoing specification, the invention and its benefits andadvantages have been described with reference to specific embodiments.However, one of ordinary skill in the art appreciates that variousmodifications and changes can be made without departing from the scopeof the present invention as set forth in the claims below. Accordingly,the specification and figures are to be regarded in an illustrativerather than a restrictive sense, and all such modifications are intendedto be included within the scope of present invention. The benefits,advantages, solutions to problems, and any element(s) that may cause anybenefit, advantage, or solution to occur or become more pronounced arenot to be construed as a critical, required, or essential features orelements of any or all the claims.

As used herein, the terms “comprises,” “comprising,” or any othervariation thereof, are intended to cover a non-exclusive inclusion, suchthat a process, method, article, or apparatus that comprises a list ofelements does not include only those elements but may include otherelements not expressly listed or inherent to such process, method,article, or apparatus.

A “set” as used herein, means an empty or non-empty set (i.e., for thesets defined herein, comprising at least one member). The term“another”, as used herein, is defined as at least a second or more. Theterms “including” and/or “having”, as used herein, are defined ascomprising. The term “coupled”, as used herein with reference toelectro-optical technology, is defined as connected, although notnecessarily directly, and not necessarily mechanically. The term“program”, as used herein, is defined as a sequence of instructionsdesigned for execution on a computer system. A “program”, or “computerprogram”, may include a subroutine, a function, a procedure, an objectmethod, an object implementation, an executable application, an applet,a servlet, a source code, an object code, a shared library/dynamic loadlibrary and/or other sequence of instructions designed for execution ona computer system. It is further understood that the use of relationalterms, if any, such as first and second, top and bottom, and the likeare used solely to distinguish one entity or action from another entityor action without necessarily requiring or implying any actual suchrelationship or order between such entities or actions.

1. A method of classifying and ranking multimodal interpretations, themethod comprising: generating a set of joint multimodal interpretationsfrom a set of multimodal interpretations generated by one or moremodalities during a turn; generating a set of integrated multimodalinterpretations comprising an integrated multimodal interpretationformed from each joint multimodal interpretation by unifying typefeature structures of each multimodal interpretation in the jointmultimodal interpretation; and generating a multilevel confidence scorefor each integrated multimodal interpretation based on at least one of acontext score, a content score, and a confidence score of the integratedmultimodal interpretation.
 2. The method of classifying and rankingmultimodal interpretations according to claim 1, wherein generating aset of joint multimodal interpretations from a set of multimodalinterpretations generated by one or more modalities during a turncomprises: receiving a set of multimodal interpretations that aregenerated within a turn by a set of modalities, wherein each multimodalinterpretation is one of either a non-ambiguous interpretation or anambiguous interpretation of a subset of ambiguous interpretations of theset of multimodal interpretations, and wherein each multimodalinterpretation is identified with the modality that has generated themultimodal interpretation, and wherein each multimodal interpretation isassigned an interpretation type and confidence score; and generating aset of joint multimodal interpretations comprising all possible jointmultimodal interpretations, each of which is formed from one of allpossible combinations of multimodal interpretations of the set ofmultimodal interpretations such that each joint multimodalinterpretation has no more than one ambiguous interpretation from eachsubset of ambiguous interpretations, and such that the multimodalinterpretations of each joint multimodal interpretation satisfy aninterpretation type relationship of a defined set of interpretation typerelationships based on the interpretation type and modality of eachmultimodal interpretation.
 3. The method of classifying and rankingmultimodal interpretations according to claim 1, wherein the contextscore of the integrated multimodal interpretation is generated based onat least one of: a quantity of turns ago an integrated multimodalinterpretation was generated that had a type that is related to the typeof the integrated multimodal interpretation; a type relationship scoredetermined by a type of relationship between the integrated multimodalinterpretation and the integrated multimodal interpretation generatedthe quantity of turns ago; and a set of type relationship scoresdetermined by types of relationships between the integrated multimodalinterpretation and each of a set of integrated multimodalinterpretations that have been predicted to be generated within theturn.
 4. The method of classifying and ranking multimodalinterpretations according to claim 3, wherein the context score isinversely related to the quantity of turns.
 5. The method of classifyingand ranking multimodal interpretations according to claim 3, wherein thetypes of relationships comprise direct super, direct sub, directcontained, direct container, same, none, indirect super, indirect sub,indirect contained, and indirect container.
 6. The method of classifyingand ranking multimodal interpretations according to claim 1, wherein thecontent score of the integrated multimodal interpretation is generatedbased on at least one of: a quantity of attributes included in the typefeature structure of the integrated multimodal interpretation; aquantity of the attributes that were given at least one value in theturn; a quantity of the attributes that were given redundant values inthe turn; a quantity of the attributes that were given an explicitreference and not given a value in the turn; and a confidence score foreach attribute that was given a value in the turn.
 7. The method ofclassifying and ranking multimodal interpretations according to claim 1,wherein the confidence score for the integrated multimodalinterpretation is generated based on at least one of: a confidence scoregenerated for each multimodal interpretation that is included in a setof multimodal interpretations that formed the integrated multimodalinterpretation; a set of type relationship scores determined by a typeof relationship between each pair of multimodal interpretations withinthe set of multimodal interpretations that formed the integratedmultimodal interpretation; and a reliability score of each of the one ormore modalities that generated the set of multimodal interpretationsthat formed the integrated multimodal interpretation.
 8. The method ofclassifying and ranking multimodal interpretations according to claim 7,wherein the reliability score of a modality is based on a history ofconfidence scores of the modality and a current environment type. 9-14.(canceled)